[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS by elvischenv · Pull Request #15729 · sgl-project/sglang

elvischenv · 2025-12-24T07:24:59Z

Motivation

This PR is to support Flashinfer rope_quantize_fp8_append_paged_kv_cache kernel for trtllm_mha backend and enable it on GPT-OSS.

Depends on a Flashinfer PR to fix the piecewise cudagraph compatibility issue: flashinfer-ai/flashinfer#2792

Tested cmd

server:

SGLANG_ENABLE_FLASHINFER_ROPE_FUSION=1 \
sglang serve \
--model-path openai/gpt-oss-120b \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 1024 \
--cuda-graph-max-bs 1024 \
--stream-interval 20 \
--disable-radix-cache \
--model-loader-extra-config '{"enable_multithread_load": true}'

server with eagle:

SGLANG_ENABLE_SPEC_V2=1 \
SGLANG_ENABLE_FLASHINFER_ROPE_FUSION=1 \
sglang serve \
--model-path openai/gpt-oss-120b \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 1024 \
--cuda-graph-max-bs 1024 \
--stream-interval 20 \
--disable-radix-cache \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model nvidia/gpt-oss-120b-Eagle3

client(accuracy):

OPENAI_API_KEY="test" \
python -m gpt_oss.evals \
--base-url http://127.0.0.1:30000/v1 \
--model openai/gpt-oss-120b \
--reasoning-effort high \
--n-threads 512 \
--eval aime25

client(benchmark TP8 conc8):

python3 -m sglang.bench_serving \
--model openai/gpt-oss-120b \
--backend sglang \
--dataset-name random \
--max-concurrency 8 \
--num-prompts 80 \
--random-input-len 1024 \
--random-output-len 1024 \
--random-range-ratio 1.0

Accuracy Results

PR

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_193213', 'metric': 0.9125}]

PR with eagle

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_210444', 'metric': 0.9083333333333333}]

main

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_200520', 'metric': 0.9166666666666666}]

main with eagle

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_212946', 'metric': 0.9041666666666667}]

Perf (GPT-OSS-120b TP8 con8)

PR: about 7% perf gain

Median TPOT (ms):                        2.80

main

Median TPOT (ms):                        3.02

Eagle Accept length

PR:

Accept length:                           2.05

main:

Accept length:                           1.99

Modifications

trtllm_mha_backend.py: support core rope_quantize_fp8_append_paged_kv_cache kernel
gpt_oss.py: defer RoPE into attention backend
radix_attention.py: defer RoPE into attention backend
environ.py: add SGLANG_ENABLE_FLASHINFER_ROPE_FUSION, by default disabled
test_trtllm_mha_backend.py: test trtllm mha backend, including basic and rope fusion functionality
test_gpt_oss_models_rope_fusion.py: test gpt-oss e2e accuracy with rope fusion enabled

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-24T07:25:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

nvpohanh · 2026-03-12T11:56:30Z

This can be reviewed together with #19451 . They are very similar, except that one is for trtllm_mha and one is for trtllm_mla

Fridge003 · 2026-03-20T21:46:39Z

For the accuracy results, which model are you testing on?
Can you please post accuracy results for MTP, to make sure its acceptance length doesn't drop

Fridge003 · 2026-03-20T21:42:30Z

python/sglang/srt/layers/attention/base_attn_backend.py

        return None
+
+    def support_rope_fusion(self) -> bool:
+        """Check if the current backend supports RoPE fusion."""


Instead of adding this method in base class, can we control this fusion with an environ flag?
Now it is set to False by default. After this feature stabilizes it can be turned on by default

elvischenv

@Fridge003 Updated the testing results in the PR description. This PR currently depends on a Flashinfer PR flashinfer-ai/flashinfer#2792 to fix the compatibility issue with piecewise cudagraph.

github-actions bot added the blackwell SM100/SM120 label Dec 24, 2025

nvpohanh mentioned this pull request Dec 24, 2025

[Tracking] GPT-OSS B200/GB200 performance optimization tracker #15243

Open

11 tasks

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 5e5c50f to 7cc00cb Compare February 7, 2026 12:42

elvischenv marked this pull request as ready for review February 7, 2026 12:43

elvischenv requested review from Fridge003, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners February 7, 2026 12:43

ispobock assigned Qiaolin-Yu Feb 18, 2026

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 7cc00cb to 191dcf2 Compare February 24, 2026 04:45

elvischenv requested a review from HaiShaw as a code owner February 24, 2026 04:45

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 191dcf2 to e59267f Compare February 26, 2026 03:34

nvpohanh mentioned this pull request Mar 12, 2026

Integrate fused flashinfer rope_quantize_fp8_append_paged_kv_cache kernel #19451

Open

5 tasks

Fridge003 reviewed Mar 24, 2026

View reviewed changes

elvischenv added 10 commits March 25, 2026 20:48

clean up trtllm mha backend and use NHD kv layout

28bc781

support Flashinfer rope_quantize_fp8 + append_paged_kv_cache for GPT-OSS

c2ae3f6

support Flashinfer rope_quantize_fp8_append_paged_kv_cache

d6a3b6d

fix api breakage

e00ee88

add SGLANG_ENABLE_FLASHINFER_ROPE_FUSION

cea91e7

fix kv_indices with swa

aad15da

fix padding token with piecewise cudagraph

3750e1b

fix spec path

24a0201

view correctly

a0a869d

add unit test

dc6a0ac

elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from e59267f to dc6a0ac Compare March 26, 2026 03:48

elvischenv requested a review from Ying1123 as a code owner March 26, 2026 03:48

elvischenv requested review from BBuf, Edwardf0t1 and ch-wan as code owners March 26, 2026 03:48

elvischenv commented Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729
elvischenv wants to merge 10 commits intosgl-project:mainfrom
elvischenv:elvischenv/gpt-oss_rope_quant_kv

elvischenv commented Dec 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

nvpohanh commented Mar 12, 2026

Uh oh!

Fridge003 commented Mar 20, 2026

Uh oh!

Fridge003 Mar 20, 2026

Uh oh!

elvischenv left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

elvischenv commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tested cmd

Accuracy Results

Perf (GPT-OSS-120b TP8 con8)

Eagle Accept length

Modifications

Checklist

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

nvpohanh commented Mar 12, 2026

Uh oh!

Fridge003 commented Mar 20, 2026

Uh oh!

Fridge003 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

elvischenv left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elvischenv commented Dec 24, 2025 •

edited

Loading